59 research outputs found
Self Paced Deep Learning for Weakly Supervised Object Detection
In a weakly-supervised scenario object detectors need to be trained using
image-level annotation alone. Since bounding-box-level ground truth is not
available, most of the solutions proposed so far are based on an iterative,
Multiple Instance Learning framework in which the current classifier is used to
select the highest-confidence boxes in each image, which are treated as
pseudo-ground truth in the next training iteration. However, the errors of an
immature classifier can make the process drift, usually introducing many of
false positives in the training dataset. To alleviate this problem, we propose
in this paper a training protocol based on the self-paced learning paradigm.
The main idea is to iteratively select a subset of images and boxes that are
the most reliable, and use them for training. While in the past few years
similar strategies have been adopted for SVMs and other classifiers, we are the
first showing that a self-paced approach can be used with deep-network-based
classifiers in an end-to-end training pipeline. The method we propose is built
on the fully-supervised Fast-RCNN architecture and can be applied to similar
architectures which represent the input image as a bag of boxes. We show
state-of-the-art results on Pascal VOC 2007, Pascal VOC 2010 and ILSVRC 2013.
On ILSVRC 2013 our results based on a low-capacity AlexNet network outperform
even those weakly-supervised approaches which are based on much higher-capacity
networks.Comment: To appear at IEEE Transactions on PAM
Deformable GANs for Pose-based Human Image Generation
In this paper we address the problem of generating person images conditioned
on a given pose. Specifically, given an image of a person and a target pose, we
synthesize a new image of that person in the novel pose. In order to deal with
pixel-to-pixel misalignments caused by the pose differences, we introduce
deformable skip connections in the generator of our Generative Adversarial
Network. Moreover, a nearest-neighbour loss is proposed instead of the common
L1 and L2 losses in order to match the details of the generated image with the
target image. We test our approach using photos of persons in different poses
and we compare our method with previous work in this area showing
state-of-the-art results in two benchmarks. Our method can be applied to the
wider field of deformable object generation, provided that the pose of the
articulated object can be extracted using a keypoint detector.Comment: CVPR 2018 versio
StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model
Despite the progress made in the style transfer task, most previous work
focus on transferring only relatively simple features like color or texture,
while missing more abstract concepts such as overall art expression or
painter-specific traits. However, these abstract semantics can be captured by
models like DALL-E or CLIP, which have been trained using huge datasets of
images and textual documents. In this paper, we propose StylerDALLE, a style
transfer method that exploits both of these models and uses natural language to
describe abstract art styles. Specifically, we formulate the language-guided
style transfer task as a non-autoregressive token sequence translation, i.e.,
from input content image to output stylized image, in the discrete latent space
of a large-scale pretrained vector-quantized tokenizer, e.g., the discrete
variational auto-encoder (dVAE) of DALL-E. To incorporate style information, we
propose a Reinforcement Learning strategy with CLIP-based language supervision
that ensures stylization and content preservation simultaneously. Experimental
results demonstrate the superiority of our method, which can effectively
transfer art styles using language instructions at different granularities.
Code is available at https://github.com/zipengxuc/StylerDALLE.Comment: ICCV 202
Deep Metric and Hash-Code Learning for Content-Based Retrieval of Remote Sensing Images
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The growing volume of Remote Sensing (RS) image archives demands for feature learning techniques and hashing functions which can: (1) accurately represent the semantics in the RS images; and (2) have quasi real-time performance during retrieval. This paper aims to address both challenges at the same time, by learning a semantic-based metric space for content based RS image retrieval while simultaneously producing binary hash codes for an efficient archive search. This double goal is achieved by training a deep network using a combination of different loss functions which, on the one hand, aim at clustering semantically similar samples (i.e., images), and, on the other hand, encourage the network to produce final activation values (i.e., descriptors) that can be easily binarized. Moreover, since RS annotated training images are too few to train a deep network from scratch, we propose to split the image representation problem in two different phases. In the first we use a general-purpose, pre-trained network to produce an intermediate representation, and in the second we train our hashing network using a relatively small set of training images. Experiments on two aerial benchmark archives show that the proposed method outperforms previous state-of-the-art hashing approaches by up to 5.4% using the same number of hash bits per image.EC/H2020/759764/EU/Accurate and Scalable Processing of Big Data in Earth Observation/BigEart
SpectralCLIP: Preventing Artifacts in Text-Guided Style Transfer from a Spectral Perspective
Owing to the power of vision-language foundation models, e.g., CLIP, the area
of image synthesis has seen recent important advances. Particularly, for style
transfer, CLIP enables transferring more general and abstract styles without
collecting the style images in advance, as the style can be efficiently
described with natural language, and the result is optimized by minimizing the
CLIP similarity between the text description and the stylized image. However,
directly using CLIP to guide style transfer leads to undesirable artifacts
(mainly written words and unrelated visual entities) spread over the image. In
this paper, we propose SpectralCLIP, which is based on a spectral
representation of the CLIP embedding sequence, where most of the common
artifacts occupy specific frequencies. By masking the band including these
frequencies, we can condition the generation process to adhere to the target
style properties (e.g., color, texture, paint stroke, etc.) while excluding the
generation of larger-scale structures corresponding to the artifacts.
Experimental results show that SpectralCLIP prevents the generation of
artifacts effectively in quantitative and qualitative terms, without impairing
the stylisation quality. We also apply SpectralCLIP to text-conditioned image
generation and show that it prevents written words in the generated images. Our
code will be publicly available
Simulasi (gosip) infotainment dalam retorika Image (keaiban) selebritis
10 pagesInternational audienceWe present a generalization of the person-image generation task, in which a human image is generated conditioned on a target pose and a set X of source appearance images. In this way, we can exploit multiple, possibly complementary images of the same person which are usually available at training and at testing time. The solution we propose is mainly based on a local attention mechanism which selects relevant information from different source image regions, avoiding the necessity to build specific generators for each specific cardinality of X. The empirical evaluation of our method shows the practical interest of addressing the person-image generation problem in a multi-source setting
- …